Using a co-similarity approach on a large scale text categorization task
نویسندگان
چکیده
This paper presents a framework we developed for the second Large Scale Hierarchical Text Categorization challenge LSHTC2 . The main idea is to propose a method allowing to deal with the terms variability among the categories in order to be able to find similarities between collections of documents belonging to the same category but having few common terms. Thus, we used a co-similarity based approach, named χ-Sim, that we introduced in previous work. Nevertheless, as this co-similarity methods are not highly scalable, we need to implement a “divide and conquer” approach to split the categories into a set of clusters containing semantically related documents. This lead to a two-stage strategy for the document categorization: first, we decide in which cluster the test document belongs, and then inside the elected cluster, we perform the final categorization that is based on our co-similarity approach. RÉSUMÉ. Ce papier présente une architecture développée pour participer au second défi LSHTC de classification de textes à grande échelle. L’idée est de proposer une méthode permettant de traiter la variabilité terminologique des documents afin de trouver des similarités entre des collections appartenant à la même catégorie sémantique, mais n’ayant que peu de termes en communs. Nous avons utilisé une approche basée sur la co-similarité, nommé χ-Sim, présentée dans de précédents travaux. Néanmoins, cette méthode de calcul de co-similarité passant difficilement à l’échelle, nous avons défini une approche de type « diviser pour régner » pour découper les catégories en groupes (clusters) contenant des documents sémantiquement proches. Ceci nous conduit à une stratégie à deux étapes pour la tâche de classification des documents : premièrement, nous affectons chaque document à un cluster, puis à l’intérieur de celui-ci, nous réalisons la classification finale basée sur notre approche de calcul de co-similarité.
منابع مشابه
A New Co-similarity Measure : Application to Text Mining and Bioinformatics. (Une Nouvelle Mesure de Co-Similarité : Applications aux Données Textuelles et Génomique)
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and there exist a multitude of different clustering algorithms for different settings. As datasets become larger and more varied, adaptations of existing algorithms are required to maintain the quality of clus...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملA Fuzzy Similarity Approach in Text Classification Task
We present a fuzzy similarity approach to solve a text categorization problem. The effectiveness of various fuzzy conjunction and disjunction operators used in fuzzy similarity formula and several document representations were evaluated using test sets from three text document collections. Based on empirical results obtained from using these collections, a special case of the fuzzy similarity f...
متن کاملText Categorization Using Word Similarities Based on Higher Order Co-occurrences
12 In this paper, we propose an extension of the χ-Sim coclustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are simila...
متن کاملOptimization of Text Classification Using Supervised and Unsupervised Learning Approach
Text Classification, also known as text categorization, is the task of automatically allocating unlabeled documents into predefined categories. Text Classification means allocating a document to one or more categories or classes. The ability to accurately perform a classification task depends on the representations of documents to be classified. Text representations transform the textural docum...
متن کامل